Synopsis: ID Required
Get introduced to another antipattern that specifies the importance of using specific keys in the database.
We'll cover the following
Recently, a software developer asked me a question about preventing duplicate rows. It’s a question I get asked frequently and the first response that comes to mind is to ask the developer to check if she lacks a primary key. But in this case, this wasn’t the problem.
The developer in this case had stored articles for publishing on a website in his content management database. He had used an intersection table for a many-to-many association between a table of articles and a table of tags.
He was getting incorrect results from queries when counting the number of articles with a given tag. He knew that there was only one article with a particular tag, but the query was telling him that there were three.
When he queried all the rows matching that tag_id
, he saw that the tag was
associated with one particular article thrice; three rows showed the
same association, although they had different values for id
.
After seeing the details, we came to know that the values for article_id
and tag_id
are repeating three times.
Do I really need a primary key?
This table did have a primary key. However, that primary key wasn’t preventing duplicates in the columns that mattered. One remedy might be to create a UNIQUE
constraint over the other two columns, but, even then, the question remains:, why is the id
column needed at all?
Objective: Establish primary key conventions#
The objective is to make sure that every table has a primary key, but confusion about the nature of a primary key has resulted in an antipattern.
Everyone who has been introduced to database design knows that a primary key is an important, even mandatory, part of a table. This is true; primary keys are integral to good database design. A primary key is guaranteed to be unique over all rows in the table, so this is the logical mechanism to address individual rows and to prevent duplicate rows from being stored. A primary key is also referenced by foreign keys to create table associations.
The tricky part is choosing a column to serve as the primary key. The value of any attribute in most tables has the potential to belong in more than one row. Textbook examples such as a person’s first name and last name are clearly subject to having duplicates. Even email addresses or administrative identification numbers such as a United States Social Security number or taxpayer ID number aren’t strictly unique. A new column is needed in such tables to store an artificial value that has no meaning in the domain modeled by the table. This column is used as the primary key, so we can address rows uniquely while allowing any other attribute column to contain duplicates if it is needed. This type of primary key column is sometimes called a pseudo key or a surrogate key.
In order to make sure that rows can be given unique pseudo key values (even when concurrent clients are inserting new rows), most databases provide a mechanism to generate unique integer values serially outside the scope of transaction isolation.
Pseudo keys weren’t standardized until SQL:2003, so each database uses its own extension to SQL to implement them. Even the terminology for pseudo keys is vendor-dependent, as shown by the following table:
Features | Supported by Database Brands |
---|---|
AUTO_INCREMENT |
MySQL |
GENERATOR |
Firebird, InterBase |
IDENTITY |
DB2, Derby, Microsoft SQL Server, Sybase |
ROWID |
SQLite |
SEQUENCE |
DB2, Firebird, Informix, Ingres, Oracle, PostgreSQL |
SERIAL |
MySQL, PostgreSQL |
Pseudo keys are a useful feature, but they aren’t the only solution for declaring a primary key.
Legitimate uses of the antipattern#
Some object-relational frameworks simplify development by assuming convention over configuration. These frameworks expect every table to define its primary key in the same way — as an integer pseudo key column named id
. It is best to conform to these conventions when we are using such a framework because this gives us access to other desirable features of the framework.
There’s also nothing wrong with using a pseudo key or assigning values from
an auto-incrementing integer mechanism. But not every table really needs a
pseudo key, and it’s not necessary to name every pseudo key id
.
A pseudo key is a good choice as a surrogate for a natural key that’s too long to be practical. For example, for a table that records attributes of a file on the filesystem, the path of the file might be a good natural key, but it would be costly to index a string column that long.